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Abstract 

We developed a general framework for family-based imputation using single-nucleotide polymorphism data and 
sequence data distributed by Genetic Analysis Workshop 18. By using PedlBD, we first inferred haplotypes and 
inheritance patterns of each family from SNP data. Then new variants in unsequenced family members can be 
obtained from sequenced relatives through their shared haplotypes. We then compared the results of our method 
against the imputation results provided by Genetic Analysis Workshop organizers. The results showed that our 
strategy uncovered more variants for more unsequenced relatives. We also showed that recombination breakpoints 
inferred by PedlBD have much higher resolution than those inferred from previous studies. 



Background 

Next-generation sequencing (NGS) technologies have pro- 
foundly changed the landscape of genetic studies [1]. 
Although the cost of sequencing is becoming more afford- 
able, increasingly more studies are choosing NGS as the 
primary platform to collect data, either at the whole gen- 
ome level or for targeted regions. However, costs of 
sequencing thousands of individuals and the downstream 
analysis are still prohibitively high. On the other hand, 
many projects have already accumulated single-nucleotide 
polymorphism (SNP) data from previous studies. In such 
cases, researchers only need to sequence a small subset of 
family members (e.g., proband and parents) to reduce the 
costs. By jointly analyzing sequence data from a subset of 
family members together with SNP data from the families, 
computational approaches may fully recover variant infor- 
mation in unsequenced members. The data distributed by 
Genetic Analysis Workshop 18 (GAW18) provide an 
excellent example based on this design strategy. Many of 
the pedigrees are very large, and all of them have a signifi- 
cant number of members without SNP genotypes, which 
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makes the imputation computationally very challenging. 
Our laboratory has recently developed an efficient haplo- 
type inference algorithm called PedlBD, which is designed 
specifically for large pedigrees with many untyped indivi- 
duals [2] . By taking advantage of haplotypes inferred by 
PedlBD using SNP data, we developed a procedure to 
computationally impute variants for unsequenced indivi- 
duals based on haplotype sharing between them and their 
sequenced relatives. The advantage of our approach over 
the imputation provided by GAW lies in the fact that 
whereas our approach can take each pedigree as a whole 
when inferring haplotype or inheritance, GAW had to par- 
tition big pedigrees into smaller families. Our approach 
thus will provide more complete and more accurate 
results. In addition, based on the provided SNP data, we 
can also provide inferred recombination breakpoints with 
high resolution within each pedigree. 

Methods 

Data 

We focused our analysis on chromosome 3 of the GAW18 
dataset. The dataset consists of 1389 individuals from 20 
families. Among them, 959 individuals were genotyped 
using SNP chips. In addition, a subset of 464 genotyped 
individuals were also sequenced. The total number of SNPs 
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from the chip data is 65,519. Because only rs numbers of 
these SNPs were provided, we obtained their map positions 
from the NCBI dbSNP database (BuUd 37). Nineteen SNPs 
were removed because they either had no matched rs num- 
bers or the SNPs with the same rs numbers were mapped 
to a different chromosome. Sequence data was converted 
to A/C/G/T format using VCFtools [3]. The total number 
of SNPs called from sequence data is approximately 1.75 
million per individual. After removing SNPs with a high 
missing rate (>5%), the total number of sequence variants 
that used in our analysis is approximately 1.69 million. 

Analysis 

Our family-based imputation approach works in several 
steps. First, recombination breakpoints are inferred and 
haplotypes are assigned at each recombination-free 
segment for each individual with SNP chip data using 
PedlBD. Some individuals without chip data may also be 
assigned some unique haplotypes based on the inferred 
inheritance (e.g., untyped parents with typed children). 
Then, for newly discovered SNPs from sequenced indivi- 
duals, at each individual locus, the allele on a haplotype 
can be determined if a sequenced individual sharing the 
same haplotype is homozygous at this locus. After all 
homozygous SNPs have been processed, the information 
can be propagated to heterozygous SNPs if the allele on 



one haplotype has already been assigned. Genotypes of 
unsequenced individuals can then be imputed based on 
their assigned haplotypes (see Figure 1 for the frame- 
work and an example). Conflicts may occur when the 
algorithm tries to assign different allele types to the 
same haplotype. Conflicts reflect inconsistency between 
inferred inheritance from chip data and observed SNPs 
from sequence data. Although there is a possibility that 
the inferred inheritance could be wrong, a significant 
majority of conflicts are actually due to high genotype 
calling errors from sequence data. One should notice 
that under the assumption that genotyping errors are 
randomly distributed among all SNPs in sequenced 
individuals, the total number of loci with conflicts will 
be proportional to the number of SNPs as well as the 
number of individuals with genotyping errors even when 
the typing error rate is a constant. Therefore, the total 
number of loci with conflicts increases with the size of a 
pedigree and can thus be substantial in large pedigrees. 

Figure 2 shows the pedigree structure of family 21 and 
one haplotype segment inferred by PedlBD. There are sev- 
eral characteristics of the proposed algorithm. First, 
because information from the whole pedigree has been 
used, it is possible that haplotypes for individuals with no 
data at all can be recovered (e.g., individual 949). It is also 
possible that only one of the two haplotypes of an 
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Figure 1 The imputation frameworit (left) and an example illustrating the imputation procedure (right). In the example, individuals with 
grey color have single-nucleotide polymorphism (SNP) chip data from genome-wide association studies (GWAS), and individuals with black color 
have both chip data and sequence data. The haplotypes in this segment are labelled using different colors and they are inferred based on 
GWAS data. Notice that both haplotypes of individual 949 and one haplotype of individual 957 can be recovered based on the information of 
their children (the missed haplotype is illustrated using a thin black bar). However, only one haplotype can be recovered for 957 because he 
only has one child. The two variants are from sequence data (1 and 2 are alleles, and 0 is missing). For the first variant, because member 974 is 
homozygous genotype (1, 1), the alleles on its two haplotypes (pink and dark blue) can be assigned. Subsequently, the alleles on the light blue 
haplotype of member 940, the yellow haplotype of member 956, and the green haplotype of member 939 can be resolved (all three have 
sequence data). For all the other members, their alleles can be imputed based on the color of their haplotypes. However, haplotype light green 
(in members 949, 959, and 960) cannot be imputed because it has not occurred in any sequenced individual, thus showing missing one allele. 
For the second variant, our algorithm will identif/ a conflict because member 974 assigns allele 2 to the pink haplotype, and member 939 
assigns allele 1 to the pink haplotype. 
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Figure 2 The pedigree structure of family 21. The figure sliows the pedigree information of family 21 and one haplotype segment inferred 
by PedlBD for each individual. It also shows which individuals have been imputed and 5 masked individuals who selected for imputation 
accuracy with asterisks. The legends are the same as those in Figure 1 (right). 



individual with no data can be recovered (e.g., individual 
957). Second, loci with inconsistent genotypes called from 
sequence data can be identified (e.g., locus 2 in Figure 1, 
right). Third, at a variant locus identified from sequence 
data, if there are no sequenced individuals with homozy- 
gous genotypes, the phase at this position cannot be 
determined. 

However, because most new variants from sequence 
data are rare, the probability of having no homozygous 
genotypes is extremely low. 

For each offspring in a family, a switch on its haplotype 
assignment indicates a recombination event. We collected 
all recombination events on chromosome 3 and examined 
the resolution of recombination breakpoints. 

Results 

Inconsistency between single-nucleotlde polymorphism 
chip data from genome-wide association studies and 
sequence data 

Among 65,500 SNPs from genome-wide association stu- 
dies (GWAS) data, 63,803 of them were rediscovered 
from prefiltered whole genome sequencing (WGS) data. 
Because of high accuracy of chip data, we treated the 
genotypes from GWAS data as ground truth and first 
examined the SNP-calling accuracy of WGS data on this 
subset. Families 14, 15, 23, and 25 were excluded from 
our analysis because they did not have any sequenced 
individuals. The inconsistency rate is about 5.50% on 
average (Table 1). After eliminating families 7, 9, and 11, 
which had unusual high rates of missing in GWAS data, 
the inconsistency rate is about 2.25%. We anticipated the 
issue that the allele types from GWAS data and from 



sequence data may be encoded differently (i.e., different 
strands) and did not include discrepancies when alleles 
are A and T (or G and C). Among the inconsistent geno- 
types (excluding families 7, 9, and 11), 34.26% were 
caused by missing genotypes in GWAS, 60.17% were 
caused by missing in WGS, and the remaining (5.57%) 
were mismatches. The very high inconsistency for 
families 7, 9, and 11 was mainly caused by high missing 
rates of GWAS data in these families. The average miss- 
ing rate was 2.0% for WGS and 0.72% for GWAS data 
(excluding families 7, 9, and 11). Both measures indicate 
that for joint analysis of SNP and sequence data, one 
should not only impute variants in unsequenced or 
untyped individuals but also impute these missed or 
incorrectly called SNPs. For the remaining analysis, we 
have replaced the incorrect genotypes from sequence 
data using the genotypes from GWAS data. 

Comparison of imputation results between Genetic 
Analysis Workshop and our approach 

We compared our imputation results with the GENO 
dataset provided by GAW, which recovered 1.2 million 
variants for 813 individuals (including sequenced indivi- 
duals themselves). GAW took a 2-step procedure for 
imputation: a preliminary imputation based on popula- 
tion level information alone in the first step and an 
additional imputation procedure using pedigree informa- 
tion in the second step using SimWalk2 and Merlin 
[4,5]. Neither program can handle pedigrees as large as 
the GAW18 families, so both required large families to 
be partitioned into smaller subfamilies. In contrast, by 
taking each pedigree as a whole, our method was able 
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Table 1 Missing rate and Inconsistency between whole genome sequencing and genome-wide association studies 

Family ID Total number of Individuals Missing rate Genotype Inconsistency between GWAS and WGS 

All GWAS WGS GWAS WGS GWAS and WGS Cause of genotype Inconsistency 
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Asterisks (•) and Carets C^) indicate averaged numbers across families, with (*) and without C^) families 7, 9, and 11. 



to recover approximately 1.53 million SNPs for 1011 
individuals (these include an additional 198 individuals 
without sequence data or GWAS data), which accounts 
for 90.6% of total 1.69 million variants. Both the number 
of imputed SNPs and the number of imputed indivi- 
duals by our approach are substantially higher than 
those given by the GAW. For the 9.4% remaining 
variants, our imputation method found that 7.39% had 
conflicts that are similar to the one in Figure 1 (right, 
second locus), caused by calling errors from sequence 
data. About 2% of them were located between haplotype 
segments. For the rest 0.01%, all of the sequenced indi- 
viduals in each pedigree were heterozygous; therefore, 
genotypes of unsequenced individuals cannot be 
imputed. Among 1,070,318 common variants imputed by 
both methods for the 813 individuals, we found 0.15% of 
genotype mismatch between 2 sets. Our approach has 
imputed 467,485 more variants than the GENO dataset 
but missed 145,081 SNPs. The majority of the missed 
SNPs (>80%) are due to conflicts discovered by our pro- 
gram. This is consistent with the genotype-calling error 
rate from sequence data. The remaining SNPs were 
missed because their positions were out of haplotype 
segment regions. The summary of results can be found in 
Figure 3. 

Imputation accuracy 

We further evaluated the imputation accuracy by 
assuming sequence data of some individuals were 



unknown [6]. We selected 5 individuals from family 21 
(Figure 2 and Table 2). For each of them, we masked all 
of their genotypes (i.e., all genotypes were set to be 
"missing"), performed the imputation procedure, and 
then assessed the imputation accuracy as the proportion 
of correctly imputed alleles. Table 2 shows the pedigree 
relationship information for masked individuals and 
imputation accuracy. Each individual represents a 
distinct relationship within the family. Results show that 
imputation on individual 946 has the highest accuracy. 
Four of her close relatives (three children and one half- 
sibling) have been sequenced, and the average missing 
rate of the sequenced relatives was 2.13%. Most of her 
genotypes can be inferred because even if there is a 
missing variant in one of sequenced relatives at a locus, 
the other relatives may provide enough information to 
derive her genotypes. Individual 977 has the lowest 
accuracy, although both parents have been sequenced. 
Theoretically, one should be able to infer a child's geno- 
types from both parents if the inheritance is given. 
However, in this case, not only does she have a smaller 
number of sequenced relatives, but the missing rate of 
the father (4.17%) is also much higher, both of which 
contribute to the low accuracy. 

Recombination breakpoints 

The haplotypes and recombination breakpoints have been 
obtained from all families based only on GWAS data. 
Overall, there are a total of 3089 recombination events 
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Figure 3 Imputation results from our approach and data provided by Genetic Analysis Workshop (GAW). All numbers are averaged on 
individual. A) Imputation results of our method and GAW as the number of imputed variants and individuals. B) Overlaps between two sets. C) 
Itemized comparison between two sets. 



Table 2 Pedigree information of masked individual and imputation accuracy. 

Masked individual Pedigree information Accuracy 

ID (%) 



First-degree relationship Second-degree relationship 
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identified. Among them, a fraction still cannot be deter- 
mined from their parental sources because of missing 
genotypes in parents. After filtering out recombination 
events with unknown parental origins, our final dataset 
had 1361 maternal and 933 paternal recombination 
events. Because of homozygous genotypes, recombination 
breakpoints cannot always be within two adjacent SNPs. 
Still, the resolution of our inferred recombination break- 
points is very high, with more than 94% of them within 
20-kb range, and the median length is about 8 kb, which 
is a great improvement over previous results [7,8]. 

Discussion 

In this study, we have proposed a computational frame- 
work to infer haplotypes and recombination breakpoints 
and finally impute genotypes based on a subset of 
sequenced members in a pedigree. Results on GAW18 
data have demonstrated that (a) our approach is efficient 
for extremely large pedigrees and (b) we imputed more 



variants and more individuals than the one provided by 
GAW organizers. 

Our approach can be further improved in several 
directions. First, data quality, including missing and gen- 
otyping errors, can have a substantial effect on the final 
results. Many genotyping errors are actually Mendelian 
consistent, which makes error detection a challenging 
task. With the development of sequencing technologies 
as well as SNP calling algorithms, we expect the quality 
of genotyping calling from sequence data will improve, 
which in turn will improve our imputation results 
(e.g., reduce the number of conflicting loci). Second, 
given the high density of SNPs, population-level linkage 
disequilibrium can be used in imputation even for family 
data. Investigating approaches that can jointly consider 
information within families and information at the popula- 
tion level will be our future work. Third, our haplotype 
segments are defined based on all observed recombination 
events in a family. Therefore, the haplotype segments of a 
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particular individual may have been cut short unnecessa- 
rily from recombination breakpoints of other individuals, 
resulting in some variants between haplotype segments 
being dropped. We will define haplotype segmentations of 
each individual based on her or his own recombination 
breakpoints, which will reduce the number of dropped 
variants. Last, our results show that the strategy of sequen- 
cing only a small subset of family members and imputing 
others is very effective. However, the final imputation 
results may depend on many factors, such as number and 
type of relationships of sequenced relatives, as well as the 
quality (e.g., missing rate) of sequence data. A truly impor- 
tant decision is how researchers select individuals to 
sequence to optimize the amount of information acquired 
within the constraints of a budget. 
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